Large margin nearest neighbor

Large margin nearest neighbor (LMNN)[1] classification is a statistical machine learning algorithm. It learns a Pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization.

The goal of supervised learning (more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric. Large Margin Nearest Neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule.

Contents

Setup

The main intuition behind LMNN is to learn a pseudometric under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out error (a special case of cross validation) is minimized. Let the training data consist of a data set  D=\{(\vec x_1,y_1),\dots,(\vec x_n,y_n)\}\in R^d\times C, where the set of possible class categories is C=\{1,\dots,c\}.

The algorithm learns a pseudometric of the type

d(\vec x_i,\vec x_j)=(\vec x_i-\vec x_j)^\top\mathbf{M}(\vec x_i-\vec x_j).

For d(\cdot,\cdot) to be well defined, the matrix \mathbf{M} needs to be positive semi-definite. The Euclidean metric is a special case, where \mathbf{M} is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis Metric.

Figure 1 illustrates the effect of the metric under varying \mathbf{M}. The two circles show the set of points with equal distance to the center \vec x_i. In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid.

The algorithm distinguishes between two types of special data points: target neighbors and impostors.

Target Neighbors

Target neighbors are selected before learning. Each instance \vec x_i has exactly k different target neighbors within D, which all share the same class label y_i. The target neighbors are the data points that should become nearest neighbors under the learned metric. Let us denote the set of target neighbors for a data point \vec x_i as N_i.

Impostors

An impostor of a data point \vec x_i is another data point \vec x_j with a different class label (i.e. y_i\neq y_j) which is one of the k nearest neighbors of \vec x_i. During learning the algorithm tries to minimize the number of impostors for all data instances in the training set.

Algorithm

Large Margin Nearest Neighbors optimizes the matrix \mathbf{M} with the help of semidefinite programming. The objective is twofold: For every data point \vec x_i, the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector \vec x_i to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the k=3 nearest neighbor rule.

The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors

\sum_{i,j\in N_i} d(\vec x_i,\vec x_j).

The second goal is achieved by constraining impostors \vec x_l to be one unit further away than target neighbors \vec x_j (and therefore pushing them out of the local neighborhood of \vec x_i). The resulting inequality constraint can be stated as:

\forall_{i,j \in N_i,l, y_l\neq y_i} d(\vec x_i,\vec x_j)%2B1\leq d(\vec x_i,\vec x_l)

The margin of exactly one unit fixes the scale of the matrix M. Any alternative choice c>0 would result in a rescaling of M by a factor of 1/c.

The final optimization problem becomes:

 \min_{\mathbf{M}} \sum_{i,j\in N_i} d(\vec x_i,\vec x_j) %2B \sum_{i,j,l} \xi_{ijl}
\forall_{i,j \in N_i,l, y_l\neq y_i}
   d(\vec x_i,\vec x_j)%2B1\leq d(\vec x_i,\vec x_l)%2B\xi_{ijl}
 \xi_{ijl}\geq 0
 \mathbf{M}\succeq 0

Here the slack variables \xi_{ijl} absorb the amount of violations of the impostor constraints. Their overall sum is minimized. The last constraint ensures that \mathbf{M} is positive semi-definite. The optimization problem is an instance of semidefinite programming (SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime. A particularly well suited solver technique is the working set method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness.

Extensions and efficient solvers

LMNN was extended to multiple local metrics in the 2008 paper.[2] This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research,[3] Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source Matlab implementation is freely available at the authors web page.

Torresani and Lee,[4] use the kernel trick to indirectly incorporate non-linear feature transformations and solve LMNN in an inner product space. Kumal et al.[5] extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.

See also

References

  1. ^ Weinberger, K. Q.; Blitzer J. C., Saul L. K. (2006). "Distance Metric Learning for Large Margin Nearest Neighbor Classification,". Advances in Neural Information Processing Systems 18 (NIPS): 1473–1480. http://books.nips.cc/papers/files/nips18/NIPS2005_0265.pdf. 
  2. ^ Weinberger, K. Q.; Saul L. K. (2008). "Fast solvers and efficient implementations for distance metric learning". Proceedings of International Conference on Machine Learning: 1160–1167. http://research.yahoo.net/files/icml2008a.pdf. 
  3. ^ Weinberger, K. Q.; Saul L. K. (2009). "Distance Metric Learning for Large Margin Classification". Journal of Machine Learning Research 10: 207–244. http://www.jmlr.org/papers/volume10/weinberger09a/weinberger09a.pdf. 
  4. ^ Torresani, Lorenzo; Lee K. (2007). "Large Margin Component Analysis". Advances in Neural Information Processing Systems 19: 13850-1392. http://books.nips.cc/papers/files/nips19/NIPS2006_0791.pdf. 
  5. ^ Kumar, M.P.; Torr P.H.S., Zisserman A. (2007). "An invariant large margin nearest neighbour classifier". IEEE 11th International Conference on Computer Vision (ICCV), 2007: 1–8. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4409041. 

External links